Retrieval Augmented Generation 101

build a RAG pipeline, low/stable dependencies

rag

Published

August 11, 2024

Constituents

Parametric Memory: an LLM, typically instruction fine-tuned
Non Parametric Memory: any information retrieval mechanism whose source is a storage/index of information

Motivation

Large Language Models are very powerful at learning in-depth knowledge from data and its associated patterns presented to it during pre-training, and subsequent finetuning. While this behaviour makes them useful in some applications, it also renders them ineffective in the case where the training data becomes obsolete(there are many reasons for this. for e.g.: domains with dynamic knowledge bases, that frequently update over time). Repeatedly revising the parametric memory (training data) is expensive.

Large Language Models also are prone to hallucinations. there are many reasons that are postulated and there are many types Several LLM powered applications require that the user is able to trust/verify the output and also to understand what source material is used by the LLM to generate its response.

Intuition

Large Language models are (as assumed) really, really good at mimicking patterns that were seen in training data. BUT this paper that introduced GPT-3 to the world also showed that large language models are good at learning from context and patterns provided at inference time. As the task of Generative Pre training improves model accuracy in the task of Natural Language Understanding, it becomes imperative that the LLM can be introduced to fairly new concepts previously unseen in the training data(parametric memory).

Applications

The synergy of these capabilities allowed “AI developers” to build applications that harness the power of LLMs but also use them for applications that access other data sources.

let’s build :)

Definition

A RAG system at its core is a system that accepts a user query, and provides an answer after retrieving some information from an external knowledgebase.

flowchart LR
  A[user query] --> B(system)
  B --> C@{ shape: docs, label: "Knowledgebase"}
  C --> B
  B --> E[Response]

Utilitary techniques for building a RAG system

This is an introduction to the broad categories of techniques that you would need in the ever-evolving procedure to build a RAG system.

1. Document Preprocessing

The documents that form a apart of the non-parametric memory of the system need to be processed and stored in a format that allows the subsequent retrieval when given the input query.

A. Chunking

Chunking is the process of breaking down the knowledgebase into equal sized partitions “chunks” of text. This is done for two reasons

LLMs have a limited context window, therefore your system can afford to provide only the relevant information concise to the query.
LLMs also suffer from lost in the middle problems, where the LLM suffers when provided with a lot of context. So even if your LLM had an infinite context window, the current architecture suffers from the same, and it always helps to be concise.

B. Sanitisation

Parsed documents contain lot of unwanted characters that really do not offer any value. Sanitising the text after chunking them can help in optimizing the cost of running the system as the input tokens to the LLM will be considerably low.

2. Information Retrieval

A. Embeddings and Approximate Nearest Neighbour search

Okay. You’ve probably heard about this one. It was introduced in the original RAG paper and hence is now the most popular way of implementing a retrieval system.

B. Keyword search

Keyword search involves extracting keywords from a user query and searching for fuzzy/exact matches from the knowledge base. Keyword search can be grouped with ANN search in what is known widely as Hybrid search. It is a naive form of tool calling

3. Inference Parameter Tuning

Inference parameters are the parameters you would send in every API call to the model provider’s LLM completion endpoint. You can have a look at an example here There are many inference parameters that you would see, but for a naive RAG chatbot you would typically have to look at:

model
temperature
top-P
max_output_tokens
instructions: the ‘system’ message

The inference parameters depend on the model provider client API, and the architecture of the underlying models.

4. Evaluation

let’s deploy :P

Getting a RAG system like this to production depends on a lot of factors, and is primarily driven by business stakeholders’ production analysis of vector databases is here

In production

model provider API